15 research outputs found

    Multi-modal Dense Video Captioning

    Get PDF
    Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environment. In this paper, we present a new dense video captioning approach that is able to utilize any number of modalities for event description. Specifically, we show how audio and speech modalities may improve a dense video captioning model. We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track. We formulate the captioning task as a machine translation problem and utilize recently proposed Transformer architecture to convert multi-modal input data into textual descriptions. We demonstrate the performance of our model on ActivityNet Captions dataset. The ablation studies indicate a considerable contribution from audio and speech components suggesting that these modalities contain substantial complementary information to video frames. Furthermore, we provide an in-depth analysis of the ActivityNet Caption results by leveraging the category tags obtained from original YouTube videos. Code is publicly available: github.com/v-iashin/MDVCComment: To appear in the proceedings of CVPR Workshops 2020; Code: https://github.com/v-iashin/MDVC Project Page: https://v-iashin.github.io/mdv

    Multi-modal Video Content Understanding

    Get PDF
    Video is an important format of information. Humans use videos for a variety of purposes such as entertainment, education, communication, information sharing, and capturing memories. To this date, humankind accumulated a colossal amount of video material online which is freely available. Manual processing at this scale is simply impossible. To this end, many research efforts have been dedicated to the automatic processing of video content. At the same time, human perception of the world is multi-modal. A human uses multiple senses to understand the environment and objects, and their interactions. When watching a video, we perceive the content via both audio and visual modalities, and removing one of these modalities results in less immersive experience. Similarly, if information in both modalities does not correspond, it may create a sense of dissonance. Therefore, joint modelling of multiple modalities (such as audio, visual, and text) within one model is an active research area. In the last decade, the fields of automatic video understanding and multi-modal modelling have seen exceptional progress due to the ubiquitous success of deep learning models and, more recently, transformer-based architectures in particular. Our work draws on these advances and pushes the state-of-the-art of multi-modal video understanding forward. Applications of automatic multi-modal video processing are broad and exciting! For instance, the content-based textual description of a video (video captioning) may allow a visually- or auditory-impaired person to understand the content and, thus, engage in brighter social interactions. However, prior work in video content description relies on the visual input alone, missing vital information only available in the audio stream. To this end, we proposed two novel multi-modal transformer models that encode audio and visual interactions simultaneously. More specifically, first, we introduced a late-fusion multi-modal transformer that is highly modular and allows the processing of an arbitrary set of modalities. Second, an efficient bi-modal transformer was presented to encode audio-visual cues starting from the lower network layers allowing more rich audio-visual features and stronger performance as a result. Another application is the automatic visually-guided sound generation that might help professional sound (foley) designers who spend hours searching a database for relevant audio for a movie scene. Previous approaches for automatic conditional audio generation support only one class (e. g. “dog barking”), while real-life applications may require generation for hundreds of data classes and one would need to train one model for every data class which can be infeasible. To bridge this gap, we introduced a novel two-stage model that, first, efficiently encodes audio as a set of codebook vectors (i. e. trains to make “building blocks”) and, then, learns to sample these audio vectors given visual inputs to make a relevant audio track for this visual input. Moreover, we studied the automatic evaluation of the conditional audio generation model and proposed metrics that measure both quality and relevance of the generated samples. Finally, as video editing is becoming more common among non-professionals due to the increased popularity of such services as YouTube, automatic assistance during video editing grows in demand, e. g. off-sync detection between audio and visual tracks. Prior work in audio-visual synchronization was devoted to solving the task on lip-syncing datasets with “dense” signals, such as interviews and presentations. In such videos, synchronization cues occur “densely” across time, and it is enough to process just a few tens of a second to synchronize the tracks. In contrast, opendomain videos mostly have only “sparse” cues that occur just once in a seconds-long video clip (e. g. “chopping wood”). To address this, we: a) proposed a novel dataset with “sparse” sounds; b) designed a model which can efficiently encode seconds-long audio-visual tracks in a small set of “learnable selectors” that is, then, used for synchronization. In addition, we explored the temporal artefacts that common audio and video compression algorithms leave in data streams. To prevent a model from learning to rely on these artefacts, we introduced a list of recommendations on how to mitigate them. This thesis provides the details of the proposed methodologies as well as a comprehensive overview of advances in relevant fields of multi-modal video understanding. In addition, we provide a discussion of potential research directions that can bring significant contributions to the field

    Metal-free activation of C-H bonds by boron trifluoride

    Get PDF
    C-H activation is a challenging problem in modern organic chemistry. Direct C–H borylation is one of the widely growing subclasses of C–H activation. As a rule, these reactions are performed by transition metal catalysis. However, recently a metal-free approach towards C-B boron bond formation has been growing intensively. Usually, metal-free borylations are performed with a boron compound as a Lewis acid component and a Lewis base as a proton acceptor, which may or may not be preorganized for this transformation. Usually, such reactions require the use of boranes with high Lewis acidity such as B(C6F5)3, BCl3, BBr3, etc. At the same time, the chemistry of the less acidic boron trifluoride, BF3, as a borylating species is unprecedented. This work is aimed at uncovering the reactivity of BF3 towards C-H borylation of Csp–H and Csp2–H bonds. In this respect, the following factors were studied in the work: • Formation of BF3 adducts with various amines and their reactivity in Csp2–H and Csp–H borylation reactions • Scope of borylation: influence of the substrates’ electronic structure and various functional groups’ compatibility • Controlling the formation of mono-, bis-, tris-, and tetrakisorganoborates from BF3, amine, and Rsp–H/Rsp2–H substrate. • Reactivity difference between BF3·SMe2, BF3·OEt2, and BF3-1,2,2,6,6-pentamethylpiperidine (BF3·PMP) with respect to alkyne borylation Because organoboranes are often unstable reactive species, they were converted to fluoroborates by tetramethylammonium fluoride. In this respect, competing reactions of protodeborylation and fluorination of organofluoroboranes were studied. The literature review consists of two parts: metal-free borylation of triple bonds and double bonds. For triple bonds, the review is preferably limited to terminal acetylenes because internal alkynes cannot undergo Csp–H activation. The influence of each component of a Lewis pair as well as its structure’s selectivity for C–H borylation, 1,2-addition, and carboboration is discussed. The second part uncovers the topic of C-H borylation of Csp2–H bonds. It includes both concerted borylations and borylations by reactive borenium cations. There is a special accent on the chemistry of haloboranes. In order to limit the size of the review, the use of hydroboranes for C–H activation is reviewed least in this book.C-H-aktivaatio on modernin synteettisen kemian keskiössä. Suora C-H-borylaatio on yksi tärkeimmistä C-H-aktivaation alaluokista ja nämä reaktiot ovat pääsääntöisesti siirtymämetallikatalysoituja. Viime aikoina kiinnostus metallittomiin menetelmiin C-B-sidoksen muodostamiseksi on kuitenkin kasvanut voimakkaasti. Yleensä metallittomissa borylaatioissa hyödynnetään booriyhdistettä Lewis-happo -komponenttina ja Lewis-emästä protoniakseptorina. Tällaiset reaktiot edellyttävät boraanin, jolla on korkea Lewis-happamuus, kuten B(C6F5)3, BCl3 ja BBr3. Vähemmän happaman booritrifluoridin, BF3 kyky osallistua borylaatioon on ollut tuntematon, mutta tämä työ paljastaa myös BF3: n reaktiivisuuden Csp-H- ja Csp2-H-sidosten C-H-borylaatioon. Työssä tutkittiin seuraavia reaktioon vaikuttavia yksityiskohtia: • BF3-adduktien muodostuminen erilaisilla amiineilla ja niiden reaktiivisuus Csp2 – H- ja Csp – H-borylaatioreaktioissa. • Substraattien elektronisen rakenteen ja erilaisten funktionaalisten ryhmien vaikutus borylaatioon. • Mono-, bis-, tris- ja tetrakisorganoboraattien muodostumisen hallinta BF3: sta, amiinista ja Rsp – H / Rsp2 – H-substraatista. • Reaktiivisuusero BF3 · SMe2, BF3 · OEt2 ja BF3-1,2,2,6,6-pentametyylipiperidiinin (BF3 · PMP) välillä alkyyniborylaatiossa. Koska tässä työssä valmistetut organoboraanit ovat hyvin reaktiivisia, ne muutettiin fluoriboraateiksi tetrametyyliammoniumfluoridilla. Tämän takia työssä tutkittiin myös orgaanisten fluoriboraanien protodeborylaation ja fluorauksen kanssa kilpailevia reaktioita. Kirjallisuuskatsaus koostuu kahdesta osasta: kolmoissidosten ja kaksoissidosten metalliton borylointi. Kolmoissidosten osalta katsaus on rajattu terminaalisiin asetyleeneihin, koska sisäiset alkyynit eivät kykene Csp-H-aktivaatioon. Lisäksi käsitellään Lewis-parien komponenttien rakenteiden vaikutus selektiiviseen C – H-borylaatioon, 1,2-lisäykseen ja karboboraatioon. Toinen osa keskittyy Csp2 – H-sidosten C-H-borylaation. Se sisältää borylaatiot sekä boraaneilla, että reaktiivisilla boreniumkationeilla. Kirjallisuusosa korostaa erityisesti haloboraanien kemiaa, kun taas tavanomaisten hydroboraanien käyttöä C – H-aktivaatioon on käsitelty vain rajoitetust

    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

    Get PDF
    Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmtpublishedVersio

    Tetramethylammonium Fluoride : Fundamental Properties and Applications in C-F Bond-Forming Reactions and as a Base

    Get PDF
    Nucleophilic ionic sources of fluoride are essential reagents in the synthetic toolbox to access high added-value fluorinated building blocks unattainable by other means. In this review, we provide a concise description and rationale of the outstanding features of one of these reagents, tetramethylammonium fluoride (TMAF), as well as disclosing the different methods for its preparation, and how its physicochemical properties and solvation effects in different solvents are intimately associated with its reactivity. Furthermore, herein we also comprehensively describe its historic and recent utilization, up to December 2021, in C-F bond-forming reactions with special emphasis on nucleophilic aromatic substitution fluorinations with a potential sustainable application in industrial settings, as well as its use as a base capable of rendering unprecedented transformations. Keywords: tetramethylammonium fluoride; TMAF; solvation effects; nucleophilic fluorination; sustainable industrial fluorination; SNAr; [18F]-radiolabelling; superbases; selective methylation; fluorinated excited speciesPeer reviewe

    Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-robot Handovers

    Get PDF
    Human-robot object handover is a key skill for the future of human-robot collaboration. CORSMAL 2020 Challenge focuses on the perception part of this problem: the robot needs to estimate the filling mass of a container held by a human. Although there are powerful methods in image processing and audio processing individually, answering such a problem requires processing data from multiple sensors together. The appearance of the container, the sound of the filling, and the depth data provide essential information. We propose a multi-modal method to predict three key indicators of the filling mass: filling type, filling level, and container capacity. These indicators are then combined to estimate the filling mass of a container. Our method obtained Top-1 overall performance among all submissions to CORSMAL 2020 Challenge on both public and private subsets while showing no evidence of overfitting. Our source code is publicly available: https://github.com/v-iashin/CORSMALComment: Code: https://github.com/v-iashin/CORSMAL Docker: https://hub.docker.com/r/iashin/corsma

    Taming Visually Guided Sound Generation

    Get PDF
    Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGANpublishedVersio

    Tetramethylammonium Fluoride : Fundamental Properties and Applications in C-F Bond-Forming Reactions and as a Base

    Get PDF
    Nucleophilic ionic sources of fluoride are essential reagents in the synthetic toolbox to access high added-value fluorinated building blocks unattainable by other means. In this review, we provide a concise description and rationale of the outstanding features of one of these reagents, tetramethylammonium fluoride (TMAF), as well as disclosing the different methods for its preparation, and how its physicochemical properties and solvation effects in different solvents are intimately associated with its reactivity. Furthermore, herein we also comprehensively describe its historic and recent utilization, up to December 2021, in C-F bond-forming reactions with special emphasis on nucleophilic aromatic substitution fluorinations with a potential sustainable application in industrial settings, as well as its use as a base capable of rendering unprecedented transformations. Keywords: tetramethylammonium fluoride; TMAF; solvation effects; nucleophilic fluorination; sustainable industrial fluorination; SNAr; [18F]-radiolabelling; superbases; selective methylation; fluorinated excited speciesPeer reviewe

    Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

    Get PDF
    The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space. We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs 'selectors' to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams. (ii) We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task. (iii) We curate a dataset with only sparse in time and space synchronisation signals; and (iv) the effectiveness of the proposed model is shown on both dense and sparse datasets quantitatively and qualitatively. Project page: v-iashin.github.io/SparseSyncpublishedVersio

    Metal-Free C-H Borylation of N-Heteroarenes by Boron Trifluoride

    Get PDF
    Organoboron compounds are essential reagents in modern C-C coupling reactions. Their synthesis via catalytic C-H borylation by main group elements is emerging as a powerful tool alternative to transition metal based catalysis. Herein, a straightforward metal-free synthesis of aryldifluoroboranes from BF(3)and heteroarenes is reported. The reaction is assisted by sterically hindered amines and catalytic amounts of thioureas. According to computational studies the reaction proceeds via frustrated Lewis pair (FLP) mechanism. The obtained aryldifluoroboranes are further stabilized against destructive protodeborylation by converting them to the corresponding air stable tetramethylammonium organotrifluoroborates.Peer reviewe
    corecore